This document provides supplementary description of the analysis presented in Grafmiller & Szmrecsanyi (2018) “Mapping out particle placement in Englishes around the world. A case study in comparative sociolinguistic analysis” (henceforth G&S). It contains further discussion of the motivations behind our methods, along with details of the statistical anaylyses reported in G&S.

The complete dataset, along with documentation, annotation manual, and code for analysis can be freely downloaded at the following link.

Open Science repository: https://osf.io/x8vyw/.

1 The data

We investigate particle placement in 9 varieties of English (Table 1) using data from two corpora, the International Corpus of English (ICE) and the Global Corpus of Web-based English (GloWbE). We examine four varieties belonging to the Inner Circle (Kachru 1985) of world Englishes (GB, CA, NZ, and IE), which are formed of people for whom English is a first or native language (ENL). We also examine 5 Outer Circle varieties, which are formed of people for whom English is a second language (ESL). We acknowledge that the ENL/ESL division elides a great deal of inter- and intra-varietal complexity and variation, and we adopt labels mainly as terms of convenience—albeit ones with some degree of theoretical validity—for the description of our results. As we discuss below here and in G&S, we find quite robust differences between exactly these two variety groups.

Table 1: Nine varieties sampled in G&S

GB = Great Britain CA = Canada IE = Ireland
NZ = New Zealand JA = Jamaica SG = Singapore
HK = Hong Kong IN = India PH = Philippines

The distribution of the PV variants across the varieties and corpora is shown in Figure 1.

Figure 1: Distribution of PV variants by corpus and variety. Numbers represent the number of observed variants and bars represent the proportion of all tokens in the respective corpus and variety


Description of the extraction and annotation of PVs is summarized in G&S, and detailed extensively in the annotation manual liked to above. We leave those matters to interested readers and focus here on the motivations and methods of our analyses.

2 Methodological considerations

The modern standard approach to statistical analysis of sociolinguistic variables, i.e. generalized linear mixed modeling (GLMM), can provide insight into the extent to which the effects of different predictors on particle placement vary across the nine varieties studied here. GLMMs offer important advantages over older methods, e.g. Varbrul, as GLMMs are explicitly designed to incorporate variability at multiple levels of data structure, such as verb-specific preferences, which we can model through the inclusion of random effects terms.

Another asset of GLMMs is that we can use them to directly test hypotheses about external predictors, e.g. Variety, and their interactions with other internal predictors, e.g. direct object length, concreteness. However, when using such models in corpus-based research it is important to note that these hypotheses must be explicitly specified in the model a priori, i.e. as the model formula which must be explicitly specified in the analysis. In practice, corpus-based variationist research typically proceeds by testing for as many interactions as possible, and then focusing on those that pass some significance threshold, usually p < .05 (see e.g. Bresnan & Hay 2008; Szmrecsanyi et al. 2016; Szmrecsanyi et al. 2017; Wolk et al. 2013). To simplify computation and interpretation, models are often subjected to some process of model selection, where predictors and interactions are added or removed if they fail to meet some predetermined criterion. This practice is controversial however, as it has been shown to inflate individual predictor effect sizes and significance (e.g. Harrell 2001; Johnson 2010). A newer method of model averaging (Burnham & Anderson 2002) has been used with some success in linguistics (Barth & Kapatsinski 2014; Kuperman & Bresnan 2012), but the method is not without its own issues (see Cade 2015).

So we must begin our analysis by asking: what should our (initial) model structure be? Fitting model structures involving maximal random effects structures (Barr et al. 2013) and multiple fixed effects interactions unfortunately turned out to be unfeasible given the complexity of our full dataset. We were unsuccessful at fitting even a minimally adequate GLMM limited to only 2-way interactions with Variety and the other fixed-effects predictors. Additionally, models with complex random effects structures, e.g. by-verb or by-particle random slopes for Variety, consistently failed to converge. Taking a cue from recent corpus-based studies of the same stripe (e.g. Bernaisch et al. 2014; Deshors & Gries 2016; Szmrecsanyi et al. 2016), we limited our investigation of the full dataset to the more tractable random forest approach.

We present the results of our supplementary random forest analysis first before turning to the by-variety comparative analysis discussed in G&S below.

3 Random forest analysis

The random forest method (see Breiman 2001; Strobl, Malley & Tutz 2009) involves the creation of many hundreds or even thousands of decision tree models based on random subsamples of both the data and predictors. For a given observation, each tree “votes” for a given outcome, and the outcome with the most votes wins. Due to the random sampling process, random forests are quite accurate, and are robust to statistical issues common to studies of observational linguistic data, e.g. data sparseness and predictor nonlinearities (Matsuki, Kuperman & Van Dyke 2016). Most standard statistical packages also provide measures of predictor importance that are much less affected by data multicollinearity (see e.g. Liaw & Wiener 2002; Strobl, Malley & Tutz 2009). For these reasons, the random forest method is increasingly becoming a standard component of the variationist analyst’s toolkit (Baayen 2013; Bernaisch, Gries & Mukherjee 2014; Deshors & Gries 2016; Szmrecsanyi et al. 2016; Tagliamonte & Baayen 2012).

Here we present in detail the results of a supplementary random forest model fit to the entire dataset. The analysis suggests that there is a considerable degree of stability in the effects of internal constraints on particle placement across varieties of English, and where we do find regional variation in our predictors, such variation largely takes the shape of an ENL vs. ESL/EFL divide. We find some evidence of cross-varietal interactions with internal predictors, though these internal interaction effects appear less important than the effects of the functional and/or stylistic context. Split PVs are more likely in spoken and/or interpersonal language, and this tendency is much stronger in ENL than in ESL varieties (Figure 3).

We also find solid evidence for the overall influence of several internal predictors in the random forests model (Figure 2). As expected, use of the split variant decreases as the length of the direct object increases, while it increases with compositional PVs and PVs where the surprisal of the particle given the verb is highest, which we interpret as a possible measure of the semantic independence of the verb. In addition, we find that the probability of the split variant increases when a post-modifying PP is present, and when the direct object is given, definite, and/or concrete. These patterns are all in accordance with prior research.

3.1 Fitting the model

The forest model was fit using the cforest() function in R’s party package (Hothorn, Hornik & Zeileis 2006; Strobl et al. 2007; Strobl et al. 2008). This package uses conditional inference splitting methods (Hothorn, Hornik & Zeileis 2006) for growing trees, rather than the impurity reduction metrics (e.g. Gini, Entropy) common to the classification tree algorithms used in other packages such as randomForest and ranger. See Tagliamonte & Baayen (2012) for an introduction.

The random forest model formula is show below.

Response ~ Variety + Genre + DirObjWordLength + Semantics + 
  DirObjConcreteness + DirObjGivenness + DirObjDefiniteness +
  DirObjThematicity + DirectionalPP + CV.binary + Surprisal.P +
  Surprisal.V + Rhythm + PrimeType

The forest was grown on 1500 trees (ntree = 1500) sampling 4 predictors at each node (mtry = 4). Model controls and hyperparameters were otherwise set to the function defaults.

The model fits the data well (Table 2), predicting particle placement significantly better than chance (pbinom \(\approx\) 0).

Table 2: Fit statistics for random forest model

3.2 Variable importance

For evaluating the relative importance of our predictors, we use a methed based on the area under the curve (AUC), which is equivalent to the index of concordance C. This method is less biased for strongly unbalanced data, i.e. data where size of response classes differ considerably (Janitza, Strobl & Boulesteix 2013).

Figure 2: Variable importance measures of each predictor in random forest model

We find that across the dataset as a whole, Surprisal.P and the length of the direct object play the most important role in predicting particle placement. Other internal constraints such as semantics, and the presence of a directional PP are somewhat weaker. Surprisingly, the accessibility related constraints such as DirObjDefiniteness, DirObjThematicity, and DirObjGivenness have relatively little explanatory power in our model, contrary to findings else where (e.g. Dehé 2002; Gries 2003; Haddican & Johnson 2012). The particularly weak effect of givenness may be attributed to our inclusion of other predictors not normally considered, i.e. Thematicity and Surprisal, which could be capturing much more fine-grained information than our binary givenness measure. This could be the result of our automatic coding procedure for givenness, which is suboptimal for fully capturing information status. This remains an area in need of further investigation.

More pertinent to our study are the relatively high ranking effects of external predictors Genre and Variety. There is considerable variability in particle placement across both genres and varieties, and the random foest model shows that this variability is not simply reducible to influences of the internal linguistic predictors we consider here. This amount of regional and stylistic variability in particle placement has to date been underappreciated (though see Haddican & Johnson 2012; Zipp & Bernaisch 2012).

One drawback of variable importance plots is that they give no information about the details of a given predictor’s influence, i.e. what direction of an effect it has, or how it interacts with other predictors. To investigate the predictors more closely then, we turn to so-called partial dependency plots, which provide visualizations of (sets of) predictors’ effects on particle placement.

3.3 Partial dependency plots

To examine the effects of an individual predictor (and its relation to other predictors), we plot the partial dependence of the probability of the split PV variant on that predictor or predictors, averaging over the values of the other predictors. This method is similar in spirit to partial effects plots derived from regression models, though these should not be interpreted as significance tests.

Since we are dealing with a binary reponse, we plot only the predicted probability of the split (V-O-P) variant in the plots below. Additionally, we are chiefly concerned with cross-varietal differences, thus we focus only on interactions of Variety with the other predictors and do not explore potential interactions among internal predictors, though we note that the random forest model does potential capture such effects (though see Wright, Ziegler & König 2016).

Looking at the interaction between Genre and Variety (Figure 3), it’s clear that the biggest stylistic shifts in the use of PVs occurs in the ENL varieties, where the split variant is far more frequent in less formal interpersonal genres, i.e. spoken dialogues, personal letters and creative writing. The patterns across the genres are largely the same for all varieties, e.g. unscripted monologues appear to feature some of the highest use of split PVs everywhere, however the differences between genres are far more pronounced in the ENL varieties than in the ESL varieties.

Figure 3: Partial dependency plot of Genre by Variety

Note also that the stylistic trends cross-cut modality, as the relatively formal scripted monologues (“ScriptedMono”) show far fewer uses of the split variant. These spoken texts are more similar to persuasive and instructional writing than to other kinds of spoken language. Lastly we note that the two web-based genres, “Blog” and “GeneralWeb”, consistently show the lowest use of split PVs in all varieties. While it is true that some of the ICE data is 10-20 years older than that in GloWbE, it seems unlikely that all varieties have undergone an identical shift in PV usage over the past 20 years or so (but see Tagliamonte, D’Arcy & Louro 2016), especially given evidence for recent changes in the opposite direction (Haddican & Johnson 2012). We suspect therefore that this difference in GloWbE is due to differences in medium (traditional vs online writing), differences in the compilation methods (see discussion in Davies & Fuchs 2015 and replies), or to other stylistic factors related to these different usage contexts.

Turning to the cross-varietal patterns in the internal constraints (Figures 4 & 5), a couple generalizations can be made. First, it appears that the internal constraints all behave as predicted by the literature, and this applies to all varieties. That is, the effects all go in the hypothesized direction, e.g. longer direct objects disfavor the use of the split variant while the presence of a directional PP favors it. This includes the effects of surprisal (Figure 4), where we predicted that less surprising, i.e. more predictable, verb-particle combinations would be biased toward the continuous variant. Language users are more likely to choose the continuous variant with verb-particle pairs that cooccur very frequently. We thus interpret surprisal essentially as a measure of the degree of syntactic/semantic co-dependency or compositionality.

Figure 4: Partial dependency plot of surprisal measures by Variety

Again the largest difference in these plots is between the ENL and ESL varieties, and it would seem that this distinction alone is what underlies the high importance ranking assigned to Variety in Figure 2. Effects of the internal predictors are largely parallel across all varieties, with the possible exception of Semantics, which appears to have a slightly stronger effect in the ENL varieties than in the others.

Figure 5: Partial dependency plots of predictors by Variety

4 By-Variety Comparative Analysis

4.1 Adapting the Comparative Sociolinguistic method

As discussed in G&S, the comparative sociolinguistic method involves the evaluation of three ‘lines of evidence’ derived from separate models fit to different datasets, and these lines of evidence represent natural applications of regression modeling (Poplack & Tagliamonte 2001; Tagliamonte & Baayen 2012; Tagliamonte 2013). A first line of evidence involves ranking the different (sets of) constraints, or factor groups, according to the strength of their influence on the variable: Do the constraints have the same relative ranking across the datasets? The second line of evidence, statistical significance, involves a comparison of the constraints that are determined by the model to have a significant effect on the variable: Do the models of the different datasets share the same set of significant constraints? The third line of evidence consists of comparing the magnitude and direction of the significant constraints across datasets: Are the directions of the constraints (for continuous or binary factors) the same; are the orders of the levels within a categorical constraint (the ‘constraint hierarchy’ in variationist terms) the same? Are the effects of some constraints stronger in one dataset than others?

While this method has clear intuitive appeal, there are a few technical concerns that must be considered (see also Claes 2016:85). One concern lies in the way constraints’ relative explanatory contributions (relative strength) are assessed. In Varbrul analyses, relative importance is measured as the range between the highest and lowest factor weight for a given factor group. Modern statistical tools such as R (or Rbrul) now allow for much greater flexibility in statistical modeling. For example, modern regression modeling tools allow for continuous as well as categorical predictors. However, the effect sizes of continuous and categorical predictors, as expressed in the regression coefficients, cannot be directly compared to one another unless certain transformations are applied to the model inputs (see e.g. Gelman et al. 2008). At the same time, predictors with many levels (factors) are more likely to have larger effect ranges simply by virtue of their having more levels; binary factors are thus more likely to be ranked lower by mere chance alone. Finally, effect size estimates of individual predictors in regression models are highly sensitive to correlations with other predictors, a problem known as (multi)collinearity. Unless care is taken to decorrelate predictors before model fitting, or to assess multicollinearity post-fitting, inferences about predictor importance based on coefficient estimates (ranges) can be quite unreliable.

A second problem lies in the (over)reliance on statistical significance as a threshold for (dis)similarity. Significance, as with effect size, can be highly sensitive to covariation among predictors. More importantly, comparing the significance of predictors across models fit to different datasets is ill-advised since statistical significance is dependent on sample size as well as the distribution of the predictor’s values in the data (Anderson, Burnham & Thompson 2000; Hubbard & Lindsay 2008), a problem that has also been acknowledged by comparative sociolinguists (Poplack & Tagliamonte 2001:93; Tagliamonte 2013:152, n.5). Given that probabilistic linguistic knowledge is derived from experience, it is highly unlikely that the effect of any predictor is exactly 0, which is the null hypothesis that significance tests typically assume. The same principle of course applies to interaction effects in a single model: it is doubtful that a given predictor’s effect is exactly the same in any given set of two or more unique populations. Given enough data, we could show that every constraint or interaction in every variety/dataset is significant, albeit with a very small effect size. But we are not aware of any model of grammar, usage-based or otherwise, that considers what a minimum effect size threshold should be for determining when two communities’ grammars (or individuals’ grammars for that matter) are different “enough”. In light of this, a technique that assumes a more nuanced approach to assessing (dis)similarity in probabilistic variation grammars seems desirable (see also Tagliamonte 2013).

4.2 Generalized linear mixed models

In this section we provide details of the procedure for creating and evaluting the GLMMs presented in G&S.

4.2.1 Model formulas

The fixed effects formula for each of our models is as follows.

Response ~ Register + DirObjWordLength + Semantics + DirObjConcreteness + 
    DirObjGivenness + DirObjDefiniteness + DirObjThematicity + 
    DirectionalPP + CV.binary + Surprisal.P + Surprisal.V + Rhythm + 
    PrimeType

The random effects formula for each of our models is as follows.

Response ~ FIXED_EFFECTS + (1 | Verb) + (1 | Particle) + (1 | 
    VerbPart) + (1 | Genre)

The above model structure was fit to each dataset. No further testing (e.g. via likelihood ratio tests) was conducted to assess random effects. While some models resulted in (near) singular fits with some random effects variances at or very near 0. All terms were kept in the models to keep the structures identical.

4.2.2 Predictor coding and standardization

All model inputs were standardized following procedures recommended by Gelman & Pardoe (2007) and Gelman et al. (2008). Continuous predictor inputs were centered around zero by subtracting the mean from each value, and then scaled by dividing by 2 standard deviations. Binary predictors were converted to numeric values (0 or 1) and centered to have a mean of zero. No scaling was applied to binary inputs.

Multinomial predictors, Register and PrimeType, were not standardized. Register was sum coded, with each individual coefficient representing deviation from the mean across all register levels. PrimeType was treatment coded, with ‘none’, i.e. no prior PV present, as the reference level to which the other two levels, ‘split’ or ‘continuous’, were compared.

4.2.3 Model evaluation

Here we present various model fit diagnostics for each of our models. C represents the concordance statistic, Dxy is Somers’s \(D_{xy}\), and AICc is the Akaike Information Criterion corrected for sample size and number of model parameters (Burnham & Anderson 2002). The statistic kappa is a measure of data multicollinearity (Baayen, Davidson & Bates 2008).

4.2.3.1 Model fits

Table 1: Summary statistics for by-variety models

4.2.3.2 Overdispersion

We also check for overdispersion. That is, we check whether there is greater variability, i.e. statistical dispersion, than we would expect assuming that our response is binomially distributed.

We use an approximate estimate of overdispersion based on a chi-squared test of the sum of squared Pearson residuals with degrees of freedom equal to the residual degrees of freedom in the model. Significant deviation from a theoretical \(\chi^2\) distribution (\(p < .05\)) can be considered strong evidence of overdispersion.

For more details and R code, see http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion

Table 2: Overdispersion measures for by-variety models

4.2.3.3 Multicollinearity

The variance inflation factor (VIF) provides a measure of how much the variance in the estimate of a given predictor is increased due to that predictor being correlated with other predictors in the model. High VIFs suggest unreliable estimates. Opinions vary as to what should be considered a worrisomely high VIF, from as high as 10 to as low as 3 (see O’Brian 2007; Zuur et al. 2010), but we believe the VIFs in our models are within a reasonable range.

Table 3: Variance Inflation Factors for each model predictor


4.2.4 Random effects

Here we present details on the random effects for each model. For each variety, we provide plots of the distributions of the Best Linear Unbiased Predictors (BLUPs) for each group factor in the model. These represent the adjustments to the overall log odds (i.e. the model intercept) for each genre, verb, particle, and verb-particle pair. Gray dots represent levels that are indistinguishable from zero.

For interested readers, we also provide the estimated value for each group level, that is each individual genre, verb, etc. in the tables following the plots. Positive values relfect greater bias for the split V-Object-P order, negative values reflect greater bias for the continuous V-P-Object order.

To view results for a specific variety, click the tabs below.

4.2.4.1 Great Britain

Great Britain

Figure 1: Distribution of BLUPs for GB model

Figure 1: Distribution of BLUPs for GB model

Table 3: Genre BLUPs for GB model.

Table 4: Particle BLUPs for GB model.

Table 5: Verb BLUPs for GB model.

Table 6: Verb-Particle pair BLUPs for GB model.


4.2.4.2 Canada

Canada

Figure 2: Distribution of BLUPs for CA model

Figure 2: Distribution of BLUPs for CA model

Table 7: Genre BLUPs for CA model.

Table 8: Particle BLUPs for CA model.

Table 9: Verb BLUPs for CA model.

Table 10: Verb-Particle pair BLUPs for CA model.


4.2.4.3 New Zealand

New Zealand

Figure 3: Distribution of BLUPs for NZ model

Figure 3: Distribution of BLUPs for NZ model

Table 11: Genre BLUPs for NZ model.

Table 12: Particle BLUPs for NZ model.

Table 13: Verb BLUPs for NZ model.

Table 14: Verb-Particle pair BLUPs for NZ model.


4.2.4.4 Ireland

Ireland

Figure 4: Distribution of BLUPs for IE model

Figure 4: Distribution of BLUPs for IE model

Table 15: Genre BLUPs for IE model.

Table 16: Particle BLUPs for IE model.

Table 17: Verb BLUPs for IE model.

Table 18: Verb-Particle pair BLUPs for IE model.


4.2.4.5 Jamaica

Jamaica

Figure 5: Distribution of BLUPs for JA model

Figure 5: Distribution of BLUPs for JA model

Table 19: Genre BLUPs for JA model.

Table 20: Particle BLUPs for JA model.

Table 21: Verb BLUPs for JA model.

Table 22: Verb-Particle pair BLUPs for JA model.


4.2.4.6 Singapore

Singapore

Figure 6: Distribution of BLUPs for SG model

Figure 6: Distribution of BLUPs for SG model

Table 23: Genre BLUPs for SG model.

Table 24: Particle BLUPs for SG model.

Table 25: Verb BLUPs for SG model.

Table 26: Verb-Particle pair BLUPs for SG model.


4.2.4.7 Hong Kong

Hong Kong

Figure 7: Distribution of BLUPs for HK model

Figure 7: Distribution of BLUPs for HK model

Table 27: Genre BLUPs for HK model.

Table 28: Particle BLUPs for HK model.

Table 29: Verb BLUPs for HK model.

Table 30: Verb-Particle pair BLUPs for HK model.


4.2.4.8 Philippines

Philippines

Figure 8: Distribution of BLUPs for PH model

Figure 8: Distribution of BLUPs for PH model

Table 31: Genre BLUPs for PH model.

Table 32: Particle BLUPs for PH model.

Table 33: Verb BLUPs for PH model.

Table 34: Verb-Particle pair BLUPs for PH model.


4.2.4.9 India

India

Figure 9: Distribution of BLUPs for IN model

Figure 9: Distribution of BLUPs for IN model

Table 35: Genre BLUPs for IN model.

Table 36: Particle BLUPs for IN model.

Table 37: Verb BLUPs for IN model.

Table 38: Verb-Particle pair BLUPs for IN model.


4.3 Deriving Probabilistic Distances

4.3.1 Distance matrices

For the two lines of evidence derived from the GLMMs, we calculate two distance matrices, one derived from a table of the AICc based constraint rankings across varieties (Table 39), and one from the table of the GLMM coefficients themselves (Table 40). These were calculated using a permutation procedure modelled after the permutation variable importance measures used in the random forest analysis (see also Baayen 2011).

Table 39: Variable importance rankings for by-variety GLMMs

Table 40: Coefficient estimates for by-variety GLMMs

Alternative visualization techniques allow us to inspect the data from different angles, but the methods general identify the same key patterns. We find clear Inner vs. Outer Circle groupings in the coefficient-based distances, with the position of Singapore English being somewhat in dispute. The picture is les clear with the ranking-based distances. The heirachrical clustering suggests a possible split between varieties with British vs. non-British orientations, as did the neighbornet diagram, though the MDS plot suggests that these groupings may be less robust that the cluster model(s) imply. Considering that the distances are based on ranking correlations of relatively few data points, we find that conclusions drawn from this line of evidence should be treated with extra caution. Nevertheless, the two distance matrices do show a moderate degree of correlation (r = .58, p < 0.01), so we believe both lines of evidence are worth exploring in a given analysis.

Table 41: Spearman correlations of variable importance rankings for by-variety GLMMs

Table 42: Distance matrix based on variable importance rankings for by-variety GLMMs

To calculate a distance matrix reflecting differences in coefficients (Table 43) we use the Euclidean distance measure (see e.g., Aldenderfer & Blashfield 1984:25), which defines the distance between two varieties as the square root of the sum of all squared coefficient differentials.

Table 43: Distance matrix based on coefficient estimates for by-variety GLMMs

4.3.2 Visualization

To visualize probabilistic distances, G&S use NeighborNet diagrams created from the phangorn package in R (Schliep 2011). For details of the method We refer readers to the code (04_prob_distance_analysis.R) provided in the project repository, and for conceptual discussion and interpretation of the results we refer to the relevant sections in the main paper and references cited there in.

Other methods can also be used to provide alternative visualizations, for example heierachical cluster analysis (Figures 6 & 7), or multidimensional scaling (Figures 8 & 9).

Figure 6: Cluster diagram of inter-varietal distances based on the GLMM coefficients (effect size)

Figure 7: Cluster diagram of inter-varietal distacne based on the constraint rankings

Figure 7: 3D Multidimensional scaling map based on inter-varietal distances based on the GLMM coefficients (effect size)

Alternative visualization techniques allow us to inspect the data from different angles, but the methods general identify the same key patterns. We find clear Inner vs. Outer Circle groupings in the coefficient-based distances, with the position of Singapore English being somewhat in dispute. The picture is les clear with the ranking-based distances. The heirachrical clustering suggests a possible split between varieties with British vs. non-British orientations, as did the neighbornet diagram, though the MDS plot suggests that these groupings may be less robust that the cluster model(s) imply. Considering that the distances are based on ranking correlations of relatively few data points, we find that conclusions drawn from this line of evidence should be treated with extra caution. Nevertheless, the two distance matrices do show a moderate degree of correlation (r = .58, p < 0.01), so we believe both methods are worth exploring.


References

Aldenderfer, Mark & Roger Blashfield. 1984. Cluster Analysis. 2455 Teller Road, Thousand Oaks California 91320 United States of America: SAGE Publications, Inc. doi:10.4135/9781412983648.

Anderson, David R., Kenneth P. Burnham & William L. Thompson. 2000. Null hypothesis testing: Problems, prevalence, and an alternative. The Journal of Wildlife Management 64(4). 912–923. doi:10.2307/3803199.

Baayen, R. H., D. J. Davidson & D. M. Bates. 2008. Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items. Journal of Memory and Language 59. 390–412.

Baayen, R. Harald. 2011. Corpus Linguistics and Naive Discriminative Learning. Revista Brasileira de Linguística Aplicada 11(2). 295–328. doi:10.1590/S1984-63982011000200003.

Baayen, R. Harald. 2013. languageR: Data Sets and Functions with “Analyzing Linguistic Data: A Practical Introduction to Statistics”. R Package Version 1.4.1.

Barr, Dale J., Roger Levy, Christoph Scheepers & Harry J. Tly. 2013. Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal. Journal of Memory and Language 68. 255–278.

Barth, Danielle & Vsevolod Kapatsinski. 2014. A Multimodel Inference Approach to Categorical Variant Choice: Construction, Priming and Frequency Effects on the Choice between Full and Contracted Forms of Am, Are and Is. Corpus Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2014-0022.

Bernaisch, Tobias, Stefan Th. Gries & Joybrato Mukherjee. 2014. The dative alternation in South Asian English(Es): Modelling predictors and predicting prototypes. English World-Wide 35. 7–31.

Breiman, Leo. 2001. Random Forests. Machine Learning 41. 5–32.

Bresnan, Joan & Jennifer Hay. 2008. Gradient grammar: An effect of animacy on the syntax of Give in New Zealand and American English. Lingua 118. 245–259. doi:10.1016/j.lingua.2007.02.007.

Burnham, Kenneth P. & David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer.

Cade, Brian S. 2015. Model averaging and muddled multimodel inferences. Ecology 96(9). 2370–2382. doi:10.1890/14-1639.1.

Claes, Jeroen. 2016. Cognitive, social, and individual constraints on linguistic variation: A case study of presentational, haber’ pluralization in Caribbean Spanish. (Cognitive Linguistics Research volume 60). Berlin ; Boston: De Gruyter Mouton.

Davies, Mark & Robert Fuchs. 2015. Expanding Horizons in the Study of World Englishes with the 1.9 Billion Word Global Web-Based English Corpus (GloWbE). English World-Wide 36(1). 1–28. doi:10.1075/eww.36.1.01dav.

Dehé, Nicole. 2002. Particle Verbs in English: Syntax, Information Structure, and Intonation. Amsterdam: John Benjamins.

Deshors, Sandra C. & Stefan Th. Gries. 2016. Profiling verb complementation constructions across New Englishes: A two-step random forests analysis of Ing vs. to complements. International Journal of Corpus Linguistics 21(2). 192–218. doi:10.1075/ijcl.21.2.03des.

Gelman, Andrew & Iain Pardoe. 2007. Average predictive comparisons for models with nonlinearity, interactions, and variance components. Sociological Methodology 37(1). 23–51. doi:10.1111/j.1467-9531.2007.00181.x.

Gelman, Andrew, Aleks Jakulin, Maria Grazia Pittau & Yu-Sung Su. 2008. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics 2(4). 1360–1383. doi:10.1214/08-AOAS191.

Gries, Stefan Th. 2003. Multifactorial Analysis in Corpus Linguistics: A Study of Particle Placement. New York: Continuum Press.

Haddican, Bill & Daniel Ezra Johnson. 2012. Effects on the Particle Verb Alternation across English Dialects. In, University of Pennsylvania Working Papers in Linguistics 18, 31–40. University of Pennsylvania.

Harrell, Frank E. 2001. Regression Modeling Strategies. New York: Springer.

Harrell, Frank E. 2015. Regression Modeling Strategies. 2nd ed. New York: Springer.

Hothorn, Torsten, Kurt Hornik & Achim Zeileis. 2006. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3). 651–674. doi:10.1198/106186006X133933.

Hubbard, Raymond & R. Murray Lindsay. 2008. Why p values are not a useful measure of evidence in statistical significance testing. Theory & Psychology 18(1). 69–88. doi:10.1177/0959354307086923.

Janitza, Silke, Carolin Strobl & Anne-Laure Boulesteix. 2013. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 14(1). 119. doi:10/f4s4cx.

Johnson, Daniel Ezra. 2010. Why stepwise isn’t so wise. In, NWAV 39. San Antonio, TX.

Kachru, Braj B. 1985. Standards, Codification and Sociolinguistic Realism: The English Language in the Outer Circle. In Randolph Quirk & Henry G. Widdowson (eds.), English in the World: Teaching and Learning the Language and Literatures, 11–30. Cambridge: Cambridge University Press.

Kuperman, Victor & Joan Bresnan. 2012. The effects of construction probability on word durations during spontaneous incremental sentence production. Journal of Memory and Language 66(4). 588–611.

Liaw, Andy & Matthew Wiener. 2002. Classification and regression by randomForest. R news 2(3). 18–22.

Matsuki, Kazunaga, Victor Kuperman & Julie A. Van Dyke. 2016. The Random Forests statistical technique: An examination of its value for the study of reading. Scientific Studies of Reading 20(1). 20–33. doi:10.1080/10888438.2015.1107073.

Poplack, Shana & Sali Tagliamonte. 2001. African American English in the diaspora. (Language in Society 30). Malden, MA: Blackwell.

Schliep, Klaus Peter. 2011. Phangorn: Phylogenetic analysis in R. Bioinformatics 27(4). 592–593. doi:10.1093/bioinformatics/btq706.

Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin & Achim Zeileis. 2008. Conditional Variable Importance for Random Forests. BMC Bioinformatics 9(1). 307. doi:10.1186/1471-2105-9-307.

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis & Torsten Hothorn. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(1). 25. doi:10.1186/1471-2105-8-25.

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. doi:10.1037/a0016973.

Szmrecsanyi, Benedikt, Jason Grafmiller, Joan Bresnan, Anette Rosenbach, Sali Tagliamonte & Simon Todd. 2017. Spoken syntax in a comparative perspective: The dative and genitive alternation in varieties of English. Glossa: a journal of general linguistics 2(1). 86. doi:10.5334/gjgl.310.

Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. doi:10.1075/eww.37.2.01szm.

Tagliamonte, Sali. 2013. Comparative Sociolinguistics. In J. K. Chambers & Natalie Schilling (eds.), Handbook of Language Variation and Change, 130–156. 2nd ed. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc.

Tagliamonte, Sali & Harald Baayen. 2012. Models, forests and trees of York English: Was/Were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. doi:10.1017/S0954394512000129.

Tagliamonte, Sali A., Alexandra D’Arcy & Celeste Rodríguez Louro. 2016. Outliers, impact, and rationalization in linguistic change. Language 92(4). 824–849. doi:10/gdg6vt.

Wolk, Christoph, Joan Bresnan, Anette Rosenbach & Benedikt Szmrecsányi. 2013. Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica 30(3). 382–419. doi:10.1075/dia.30.3.04wol.

Wright, Marvin N., Andreas Ziegler & Inke R. König. 2016. Do little interactions get lost in dark random forests? BMC Bioinformatics 17(1). doi:10/b5t7.

Zipp, Lena & Tobias Bernaisch. 2012. Particle Verbs across First and Second Language Varieties of English. In Mairanne Hundt & Ulrike Gut (eds.), Mapping Unity and Diversity World-Wide: Corpus-Based Studies of New Englishes, 167–196. Amsterdam: John Benjamins.